Aufgabe ist es aus der MassBank (https://massbank.eu/MassBank/) alle bisher hinterlegten Substanzen zu klassifizieren und eine deskriptive Statistik des erhaltenen Datensatzes anzufertigen. Hierzu wird im Folgendem das verwendete R-Script beschrieben. Zunächst wird ein R-Skript mit den benötigten Helfer-Funktionen eingelesen, in der benötigte Funktionen ausgelagert werden, um die Vignette übersichtlicher zu gestalten.

source("../Scripts/viz-helper.R")

Die benötigten Pakete müssen geladen werden:

library(readr)
library(rinchi)
library(magrittr)
library(classyfireR)
library(plyr)
library(dplyr)
library(sunburstR)
library(Rcpp)
library(rcdk)
library(ggplot2)
library(plotly)
library(metfRag)
library(data.table)
library(treemap)

Einlesen der InChiKeys aus der txt-Datei:

my_data <- read.table("../Data/Inchi_massbank.txt")[,3]

Im nächsten Schritt werden die InChiKeys verwendet, um eine Klassifizierung der Substanzen durchzuführen. Die Klassifizierung erfolgt mit dem Paket “classyfireR” (https://github.com/aberHRML/classyfireR). Die Funktion “get_classification” erzeugt ein JSON-Objekt für jede Substanz. Aus diesem Objekt können zum Beispiel Metadaten und die hierarchischen Klassifizierungsstufen ausgelesen werden. Für Substanzen die nicht klassifiziert werden konnten wird ein Objekt vom Typ “NULL” erzeugt.

Classification_List <- sapply(my_data, get_classification)
head(Classification_List,1)
## $`DUBNHZYBDBBJHD-UHFFFAOYSA-L`
## ── ClassyFire Object ───────────────────────────────────── classyfireR v0.3.6 ── 
## Object Size: 11.2 Kb 
##  
## Info: 
## ● InChIKey=DUBNHZYBDBBJHD-UHFFFAOYSA-L
##   
## ● [Zn++].CN(C)C([S-])=S.CN(C)C([S-])=S
##   
## ● Classification Version: 2.1
##   
## kingdom : Organic compounds
## └─superclass : Organic salts
##   └─class : Organic metal salts
##     └─subclass : Organic transition metal salts

Das erhaltene JSON Objekt aus ClassyfierR muss so formatiert werden, dass am Ende ein Datenframe mit der hierarischen Klassifizierung der Substanzen ausgestattet ist. Hierzu wurde eine Funktion classyfire2df() geschrieben, die als Eingabe die Classification_List und als Ausgabe einen Datenframe hat:

class <- vector("character", length(Classification_List[]))
df <- classyfire2df()
df

Anzahl der zu klassifizierenden Substanzen aus der Chemotion Repository (Größe des Datensatzes):

## [1] 16552

In dem erzeugten Dataframe stehen in den Spalten der jeweilige Klassifizierungsgrad und in den Reihen die jeweilige Substanz. Im ersten Schritt wird eine neue Spalte “classes” erzeugt, in der alle Spalten mit den Klassifizierungsgraden einer Substanz zusammengefasst werden und mit einem Semikolon getrennt werden. Im nächsten Schritt werden alle SPalten, außer die “classes”-Spalte gelöscht und die hierarisch angelegten Substanzklassen nach Anzahl aufsummiert. Klassifizierungsgrade, die in ihrem Namen einen Bindestrich haben, werden im folgenden Sunburst-Plot nicht richtig geplotted, deshalb werden die Bindestriche in Unterstriche umgewandelt.

col_names <- rename(df, kingdom=1,superclass=2, class=3, subclass=4, level5=5,level6=6, level7=7,level8=8, level9=9,level10=10)
df <- col_names %>% dplyr::filter(!(kingdom==""))
df$classes <- gsub('; NA','', paste(df$"kingdom",df$"superclass",df$"class", df$"subclass", df$"level5", df$"level6", df$"level7",df$"level8" ,df$"level9",df$"level10", sep = "; "))
df <-ddply(df,.(classes),summarize, size=length(classes) )

Anzahl der erfolgreich klassifizierten Substanzen, sowie der nicht klassifizierbaren Substanzen aus der Chemotion Repository:

## [1] 16536
## [1] 16

Mithilfe dieses Data Frames kann die Darstellung des Klassifizierungslevel “class” aller klassifizierten Substanzen prozentual als Sunburst-Plot erfolgen. Die Erstellung des Sunburst Plots erfolgte mit dem Paket “sunburstR” (https://github.com/timelyportfolio/sunburstR)

d3_tree <- sunburstR:::csv_to_hier(df,delim = ";")
sunburst(d3_tree, legend = FALSE,colors=c("#ff4040","#ff423d","#ff453a","#ff4838","#fe4b35","#fe4e33","#fe5130","#fd542e","#fd572b","#fc5a29","#fb5d27","#fa6025","#f96322","#f96620","#f7691e","#f66c1c","#f56f1a","#f47218","#f37517","#f17815","#f07c13","#ee7f11","#ed8210","#eb850e","#e9880d","#e88b0c","#e68e0a","#e49209","#e29508","#e09807","#de9b06","#dc9e05","#d9a104","#d7a403","#d5a703","#d2aa02","#d0ad02","#ceb001","#cbb301","#c9b600","#c6b800","#c3bb00","#c1be00","#bec100","#bbc300","#b8c600","#b6c900","#b3cb01","#b0ce01","#add002","#aad202","#a7d503","#a4d703","#a1d904","#9edc05","#9bde06","#98e007","#95e208","#92e409","#8ee60a","#8be80c","#88e90d","#85eb0e","#82ed10","#7fee11","#7cf013","#78f115","#75f317","#72f418","#6ff51a","#6cf61c","#69f71e","#66f920","#63f922","#60fa25","#5dfb27","#5afc29","#57fd2b","#54fd2e","#51fe30","#4efe33","#4bfe35","#48ff38","#45ff3a","#42ff3d","#40ff40","#3dff42","#3aff45","#38ff48","#35fe4b","#33fe4e","#30fe51","#2efd54","#2bfd57","#29fc5a","#27fb5d","#25fa60","#22f963","#20f966","#1ef769","#1cf66c","#1af56f","#18f472","#17f375","#15f178","#13f07c","#11ee7f","#10ed82","#0eeb85","#0de988","#0ce88b","#0ae68e","#09e492","#08e295","#07e098","#06de9b","#05dc9e","#04d9a1","#03d7a4","#03d5a7","#02d2aa","#02d0ad","#01ceb0","#01cbb3","#00c9b6","#00c6b8","#00c3bb","#00c1be","#00bec1","#00bbc3","#00b8c6","#00b6c9","#01b3cb","#01b0ce","#02add0","#02aad2","#03a7d5","#03a4d7","#04a1d9","#059edc","#069bde","#0798e0","#0895e2","#0992e4","#0a8ee6","#0c8be8","#0d88e9","#0e85eb","#1082ed","#117fee","#137cf0","#1578f1","#1775f3","#1872f4","#1a6ff5","#1c6cf6","#1e69f7","#2066f9","#2263f9","#2560fa","#275dfb","#295afc","#2b57fd","#2e54fd","#3051fe","#334efe","#354bfe","#3848ff","#3a45ff","#3d42ff","#4040ff","#423dff","#453aff","#4838ff","#4b35fe","#4e33fe","#5130fe","#542efd","#572bfd","#5a29fc","#5d27fb","#6025fa","#6322f9","#6620f9","#691ef7","#6c1cf6","#6f1af5","#7218f4","#7517f3","#7815f1","#7c13f0","#7f11ee","#8210ed","#850eeb","#880de9","#8b0ce8","#8e0ae6","#9209e4","#9508e2","#9807e0","#9b06de","#9e05dc","#a104d9","#a403d7","#a703d5","#aa02d2","#ad02d0","#b001ce","#b301cb","#b600c9","#b800c6","#bb00c3","#be00c1","#c100be","#c300bb","#c600b8","#c900b6","#cb01b3","#ce01b0","#d002ad","#d202aa","#d503a7","#d703a4","#d904a1","#dc059e","#de069b","#e00798","#e20895","#e40992","#e60a8e","#e80c8b","#e90d88","#eb0e85","#ed1082","#ee117f","#f0137c","#f11578","#f31775","#f41872","#f51a6f","#f61c6c","#f71e69","#f92066","#f92263","#fa2560","#fb275d","#fc295a","#fd2b57","#fd2e54","#fe3051","#fe334e","#fe354b","#ff3848","#ff3a45","#ff3d42","#ff4040"))
Legend
sunburst(d3_tree, legend = FALSE)
Legend
sunburst(d3_tree, legend = FALSE, colors = c("#23171b","#271a28","#2b1c33","#2f1e3f","#32204a","#362354","#39255f","#3b2768","#3e2a72","#402c7b","#422f83","#44318b","#453493","#46369b","#4839a2","#493ca8","#493eaf","#4a41b5","#4a44bb","#4b46c0","#4b49c5","#4b4cca","#4b4ecf","#4b51d3","#4a54d7","#4a56db","#4959de","#495ce2","#485fe5","#4761e7","#4664ea","#4567ec","#446aee","#446df0","#426ff2","#4172f3","#4075f5","#3f78f6","#3e7af7","#3d7df7","#3c80f8","#3a83f9","#3985f9","#3888f9","#378bf9","#368df9","#3590f8","#3393f8","#3295f7","#3198f7","#309bf6","#2f9df5","#2ea0f4","#2da2f3","#2ca5f1","#2ba7f0","#2aaaef","#2aaced","#29afec","#28b1ea","#28b4e8","#27b6e6","#27b8e5","#26bbe3","#26bde1","#26bfdf","#25c1dc","#25c3da","#25c6d8","#25c8d6","#25cad3","#25ccd1","#25cecf","#26d0cc","#26d2ca","#26d4c8","#27d6c5","#27d8c3","#28d9c0","#29dbbe","#29ddbb","#2adfb8","#2be0b6","#2ce2b3","#2de3b1","#2ee5ae","#30e6ac","#31e8a9","#32e9a6","#34eba4","#35eca1","#37ed9f","#39ef9c","#3af09a","#3cf197","#3ef295","#40f392","#42f490","#44f58d","#46f68b","#48f788","#4af786","#4df884","#4ff981","#51fa7f","#54fa7d","#56fb7a","#59fb78","#5cfc76","#5efc74","#61fd71","#64fd6f","#66fd6d","#69fd6b","#6cfd69","#6ffe67","#72fe65","#75fe63","#78fe61","#7bfe5f","#7efd5d","#81fd5c","#84fd5a","#87fd58","#8afc56","#8dfc55","#90fb53","#93fb51","#96fa50","#99fa4e","#9cf94d","#9ff84b","#a2f84a","#a6f748","#a9f647","#acf546","#aff444","#b2f343","#b5f242","#b8f141","#bbf03f","#beef3e","#c1ed3d","#c3ec3c","#c6eb3b","#c9e93a","#cce839","#cfe738","#d1e537","#d4e336","#d7e235","#d9e034","#dcdf33","#dedd32","#e0db32","#e3d931","#e5d730","#e7d52f","#e9d42f","#ecd22e","#eed02d","#f0ce2c","#f1cb2c","#f3c92b","#f5c72b","#f7c52a","#f8c329","#fac029","#fbbe28","#fdbc28","#feb927","#ffb727","#ffb526","#ffb226","#ffb025","#ffad25","#ffab24","#ffa824","#ffa623","#ffa323","#ffa022","#ff9e22","#ff9b21","#ff9921","#ff9621","#ff9320","#ff9020","#ff8e1f","#ff8b1f","#ff881e","#ff851e","#ff831d","#ff801d","#ff7d1d","#ff7a1c","#ff781c","#ff751b","#ff721b","#ff6f1a","#fd6c1a","#fc6a19","#fa6719","#f96418","#f76118","#f65f18","#f45c17","#f25916","#f05716","#ee5415","#ec5115","#ea4f14","#e84c14","#e64913","#e44713","#e24412","#df4212","#dd3f11","#da3d10","#d83a10","#d5380f","#d3360f","#d0330e","#ce310d","#cb2f0d","#c92d0c","#c62a0b","#c3280b","#c1260a","#be2409","#bb2309","#b92108","#b61f07","#b41d07","#b11b06","#af1a05","#ac1805","#aa1704","#a81604","#a51403","#a31302","#a11202","#9f1101","#9d1000","#9b0f00","#9a0e00","#980e00","#960d00","#950c00","#940c00","#930c00","#920c00","#910b00","#910c00","#900c00","#900c00","#900c00"))
Legend
sunburst(d3_tree, legend = FALSE, colors = c("#e7f0fa","#e6f0f9","#e5eff9","#e4eff9","#e3eef9","#e3eef8","#e2edf8","#e1edf8","#e0ecf8","#e0ecf7","#ddeaf7","#ddeaf6","#dce9f6","#dbe9f6","#dae8f6","#d9e8f5","#d9e7f5","#d8e7f5","#d7e6f5","#d6e6f4","#d6e5f4","#d5e5f4","#d2e3f3","#d2e3f3","#d1e2f3","#d0e2f2","#cfe1f2","#cee1f2","#cde0f1","#cce0f1","#ccdff1","#cbdff1","#cadef0","#c9def0","#c8ddf0","#c7ddef","#c6dcef","#c5dcef","#c4dbee","#c3dbee","#c2daee","#c1daed","#c0d9ed","#bfd9ec","#bed8ec","#bdd8ec","#bcd7eb","#bbd7eb","#b6d4e9","#b5d4e9","#b4d3e9","#b2d3e8","#b1d2e8","#b0d1e7","#afd1e7","#add0e7","#acd0e6","#abcfe6","#a9cfe5","#a8cee5","#a7cde5","#a5cde4","#a4cce4","#9ec9e2","#9dc9e2","#9cc8e1","#9ac7e1","#99c6e1","#97c6e0","#96c5e0","#94c4df","#93c3df","#91c3df","#90c2de","#8ec1de","#8dc0de","#8bc0dd","#8abfdd","#88bedc","#87bddc","#85bcdc","#84bbdb","#82bbdb","#81badb","#7fb9da","#7eb8da","#7cb7d9","#7bb6d9","#79b5d9","#78b5d8","#72b1d7","#70b0d6","#6fafd6","#6daed5","#6caed5","#6badd5","#69acd4","#68abd4","#66aad3","#65a9d3","#63a8d2","#62a7d2","#61a7d1","#5fa6d1","#5ea5d0","#5da4d0","#5ba3d0","#5aa2cf","#59a1cf","#57a0ce","#569fce","#559ecd","#549ecd","#529dcc","#519ccc","#509bcb","#4f9acb","#4d99ca","#4c98ca","#4b97c9","#4a96c9","#4895c8","#4794c8","#4693c7","#4592c7","#3f8ec4","#3e8dc3","#3d8cc3","#3c8bc2","#3b8ac2","#3a89c1","#3988c1","#3787c0","#3686c0","#3585bf","#3484bf","#3383be","#3282bd","#3181bd","#3080bc","#2f7fbc","#2e7ebb","#2d7dbb","#2878b8","#2777b7","#2676b6","#2574b6","#2473b5","#2372b4","#2371b4","#2270b3","#216fb3","#206eb2","#1f6db1","#1e6cb0","#1d6bb0","#1c6aaf","#1c69ae","#1b68ae","#1a67ad","#1966ac","#1865ab","#1864aa","#1763aa","#1662a9","#1561a8","#1560a7","#145fa6","#135ea5","#135da4","#125ca4","#115ba3","#115aa2","#1059a1","#1058a0","#0f579f","#0e569e","#0e559d","#0e549c","#0d539a","#0d5299","#0b4d93","#0b4c92","#0a4b91","#0a4a90","#0a498e","#0a488d","#09478c","#09468a","#094589","#094487","#094386","#094285","#094183","#084082","#083e80","#083d7f","#083c7d","#083b7c","#083a7a","#083979","#083877","#083776","#083674","#083573"))
Legend

Darstellung der Substanzklasse “superclass” als Treemap:

tm <- treemap(
  tm_func(),
  index = c("2"),
  vSize="size",
  type = "value",
  title = "Treemap of the superclasses in the dataset")

tm2 <- treemap(
  tm_func(),
  index = c("3"),
  vSize="size",
  type = "value",
  title = "Treemap of the classes in the dataset")

tm3 <- treemap(
  tm_func(),
  index = c("2","3"),
  vSize="size",
  type = "value",
  title = "Treemap of the superclasses and classes in the dataset")

Darstellung der superclasses als Barplot:

bar <- ggplot(bar_func(),aes(size,reorder(classes, size, sum)))+
  geom_col(fill="darkblue")+
  labs(y= "superclasses", x = "count")
ggplotly(bar)

Sessioninfo:

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=de_DE.UTF-8          LC_NUMERIC=C                 
##  [3] LC_TIME=de_DE.UTF-8           LC_COLLATE=de_DE.UTF-8       
##  [5] LC_MONETARY=de_DE.UTF-8       LC_MESSAGES=de_DE.UTF-8      
##  [7] LC_PAPER=de_DE.UTF-8          LC_NAME=de_DE.UTF-8          
##  [9] LC_ADDRESS=de_DE.UTF-8        LC_TELEPHONE=de_DE.UTF-8     
## [11] LC_MEASUREMENT=de_DE.UTF-8    LC_IDENTIFICATION=de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] treemap_2.4-2     data.table_1.14.0 metfRag_2.4.2     fingerprint_3.5.7
##  [5] rjson_0.2.20      plotly_4.9.3      ggplot2_3.3.3     rcdk_3.5.0       
##  [9] rcdklibs_2.3      rJava_0.9-13      Rcpp_1.0.6        sunburstR_2.1.5  
## [13] dplyr_1.0.5       plyr_1.8.6        classyfireR_0.3.6 magrittr_2.0.1   
## [17] rinchi_0.5        readr_1.4.0      
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2         sass_0.3.1         tidyr_1.1.3        jsonlite_1.7.2    
##  [5] viridisLite_0.3.0  bslib_0.2.4        shiny_1.6.0        assertthat_0.2.1  
##  [9] highr_0.8          yaml_2.2.1         pillar_1.5.1       glue_1.4.2        
## [13] digest_0.6.27      RColorBrewer_1.1-2 promises_1.2.0.1   colorspace_2.0-0  
## [17] htmltools_0.5.1.1  httpuv_1.6.0       clisymbols_1.2.0   pkgconfig_2.0.3   
## [21] purrr_0.3.4        xtable_1.8-4       scales_1.1.1       itertools_0.1-3   
## [25] later_1.1.0.1      tibble_3.1.0       generics_0.1.0     ellipsis_0.3.1    
## [29] withr_2.4.1        lazyeval_0.2.2     cli_2.3.1          crayon_1.4.1      
## [33] d3r_0.9.1          mime_0.10          evaluate_0.14      fansi_0.4.2       
## [37] tools_4.0.5        hms_1.0.0          lifecycle_1.0.0    gridBase_0.4-7    
## [41] stringr_1.4.0      munsell_0.5.0      compiler_4.0.5     jquerylib_0.1.4   
## [45] rlang_0.4.10       grid_4.0.5         iterators_1.0.13   rstudioapi_0.13   
## [49] squash_1.0.9       htmlwidgets_1.5.3  crosstalk_1.1.1    igraph_1.2.6      
## [53] labeling_0.4.2     rmarkdown_2.7      gtable_0.3.0       DBI_1.1.1         
## [57] curl_4.3           R6_2.5.0           knitr_1.31         fastmap_1.1.0     
## [61] utf8_1.2.1         stringi_1.5.3      parallel_4.0.5     vctrs_0.3.6       
## [65] png_0.1-7          tidyselect_1.1.0   xfun_0.22